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ABSTRACT 


Automatic extraction of semantic information from text and 
links in Web pages is key to improving the quality of search 
results. However, the assessment of automatic semantic 
measures is limited by the coverage of user studies, which 
do not scale with the size, heterogeneity, and growth of the 
Web. Here we propose to leverage human-generated meta- 
data — namely topical directories — to measure semantic 
relationships among massive numbers of pairs of Web pages 
or topics. The Open Directory Project classifies millions of 
URLs in a topical ontology, providing a rich source from 
which semantic relationships between Web pages can be de- 
rived. While semantic similarity measures based on tax- 
onomies (trees) are well studied, the design of well-founded 
similarity measures for objects stored in the nodes of arbi- 
trary ontologies (graphs) is an open problem. This paper de- 
fines an information-theoretic measure of semantic similar- 
ity that exploits both the hierarchical and non-hierarchical 
structure of an ontology. An experimental study shows 
that this measure improves significantly on the traditional 
taxonomy-based approach. This novel measure allows us to 
address the general question of how text and link analyses 
can be combined to derive measures of relevance that are in 
good agreement with semantic similarity. Surprisingly, the 
traditional use of text similarity turns out to be ineffective 
for relevance ranking. 


Categories and Subject Descriptors 


H.3.1 [Information Storage and Retrieval]: Content 
Analysis and Indexing; H.3.3 [Information Storage and 
Retrieval]: Information Search and Retrieval; H.3.4 [In- 
formation Storage and Retrieval]: Systems and Soft- 
ware—Performance evaluation (effectiveness) 
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1. INTRODUCTION 


Developing Web search mechanisms depends on address- 
ing two central questions: (1) how to find related Web pages, 
and (2) given a set of potentially related Web pages, how 
to rank them according to relevance. To evaluate the ef- 
fectiveness of a Web search mechanism in finding and rank- 
ing results, measures of semantic similarity are needed. In 
traditional approaches users provide manual assessments of 
relevance, or semantic similarity. This is difficult and ex- 
pensive. More importantly, it does not scale with the size, 
heterogeneity, and growth of the Web — subjects can evalu- 
ate sets of queries, but cannot cover exhaustively all topics. 

The Open Directory Project’ (ODP) is a large human- 
edited directory of the Web, employed by hundreds of por- 
tals and search sites including Google. The ODP classifies 
millions of URLs in a topical ontology. Ontologies help to 
make sense out of a set of objects. Once the meaning of a set 
of objects is available, it can be usefully exploited to derive 
semantic relationships between those objects. Therefore, the 
ODP provides a rich source from which measurements of se- 
mantic similarity between Web pages can be obtained. 

An ontology is a special kind of network. The problem 
of evaluating semantic similarity in a network has a long 
history in psychological theory [22]. More recently, semantic 
similarity became fundamental in knowledge representation 
where special kinds of networks or ontologies are used to 
describe objects and their relationships [6]. 

Many proposals estimate semantic similarity in a network 
representation by computing distance between the nodes. 
These frameworks are based on the premise that the closer 
the semantic relationship of two objects, the closer they 
will be in the network representation. However, as it has 
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been discussed by a number of sources, issues arise when 
attempting to apply distance-based schemes for measuring 
object similarities in certain classes of networks where links 
may not represent uniform distances [19]. 

In ontologies, certain links connect very dense and general 
categories while others connect more specific ones. To ad- 
dress this problem, some proposals estimate semantic sim- 
ilarity in a taxonomy based on the notion of information 
content [19, 12]. In these approaches, the semantic simi- 
larity between two objects is related to their commonality 
and to their differences. Given a set of objects in an “is-a” 
taxonomy, the commonality of two objects can be estimated 
by the extent to which they share information, indicated by 
the most specific class in the hierarchy that subsumes both. 
The meaning of the individual objects can be measured by 
looking at the classes rooted at each of the topics. 

Ontologies are often equated with “is-a” taxonomies, but 
ontologies need not be limited to these forms. For exam- 
ple, the ODP ontology is more complex than a simple tree. 
Some categories have multiple criteria to classify subcate- 
gories. The “Business” category, for instance, is subdivided 
by types of organizations (cooperatives, small businesses, 
major companies, etc.) as well as by areas (automotive, 
health care, telecom, etc.). Furthermore, the ODP has vari- 
ous types of cross-reference links between categories, so that 
a node may have multiple parent nodes, and even cycles are 
present. 

While semantic similarity measures based on trees are well 
studied [5], the design of well-founded similarity measures 
for objects stored in the nodes of arbitrary graphs is an open 
problem. A few empirical measures have been proposed, for 
example based on minimum cut/maximum flow algorithms 
[13], but no information-theoretic measure is known. The 
central question addressed in this paper is how to estimate 
semantic similarity in generalized ontologies, such as the 
ODP graph, taking advantage of both their hierarchical (“is- 
a” links) and non-hierarchical (cross links) components. 


1.1 Contributions and Outline 


In the next section we introduce a novel graph-based mea- 
sure of semantic similarity. To the best of our knowledge 
this is the first information-theoretic measure of similarity 
that is applicable to objects stored in the nodes of arbitrary 
graphs, in particular topical ontologies and Web directories 
that combine hierarchical and non-hierarchical components 
such as Yahoo!, ODP and their derivatives. 

Section 3 compares the graph-based semantic similarity 
measure to the tree-based one, analyzing the differences be- 
tween the two measurements and presenting an evaluation 
against human judgments of Web page similarity. We show 
that the new measure predicts human responses to a much 
greater accuracy. 

Having validated the proposed semantic similarity mea- 
sure, in Section 4 we begin to explore the question of appli- 
cations, namely how text and link analyses can be combined 
to derive measures of relevance that are in good agreement 
with semantic similarity. We consider various combinations 
of text and link similarity and discuss how these correlate 
with semantic similarity and how well they rank pages. We 
find that surprisingly, classic text-based content similarity 
is a very noisy feature, whose value is at best weakly corre- 
lated. We discuss the potential applications of this result to 
the design of semantic similarity estimates from lexical and 
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link similarity, and to the optimization of ranking functions 
in search engines. 


2. SEMANTIC SIMILARITY 
2.1 Tree-Based Similarity 


Lin [12] has investigated an information theoretic defini- 
tion of similarity that is applicable as long as the domain 
has a probabilistic model. This proposal can be used to de- 
rive a measure of semantic similarity between topics in an 
“is-a” taxonomy. 

According to Lin’s proposal, the semantic similarity be- 
tween two topics in a taxonomy is defined as a function of 
the meaning shared by the topics and the meaning of each 
of the individual topics. In a taxonomy, the meaning shared 
by two topics can be recognized by looking at the lowest 
common ancestor, which corresponds to the most specific 
common classification of the two topics. Once this common 
classification is identified, the meaning shared by two top- 
ics can be measured by the amount of information needed 
to state the commonality of the two topics. Likewise, the 
meaning of each of the individual topics is measured by the 
amount of information needed to fully describe each of the 
two topics. 

In information theory [3], the information content of a 
class or topic t is measured by the negative log likelihood, 
— log Pr[t]. The semantic similarity between two topics tı 
and t2 in a taxonomy is then measured as the ratio between 
their common meaning and their individual meanings as 
follows: 


2: log Pr{to(t1, t2)] 


T wus 
as (ti, ta) log Pr[ti] + log Pr[ts] 


where to(t1, t2) is the lowest common ancestor topic for tı 
and t2 in the tree, and Pr[t] represents the prior probability 
that any page is classified under topic t. Given a document 
d classified in a topic taxonomy, we use t(d) to refer to the 
topic node containing d. Given two documents dı and d2 in 
a topic taxonomy the semantic similarity between them is 
estimated as of (t(d1),t(d2)). To simplify notation, we use 
of (dı,d2) as a shorthand for of (t(d1), t(d2)). From here 
on, we will refer to measure go? as the tree-based semantic 
similarity. The tree-based semantic similarity measure for a 
simple taxonomy is illustrated in Figure 1. In this example, 
documents dı and d2 are contained in topics tı and t2 re- 
spectively, while topic to is their lowest common ancestor. 
In practice Pr[t] can be computed offline for every topic t 
in the ODP by counting the fraction of pages stored in the 
subtree rooted at node t (subtree(t)), out of all the pages in 
the tree. 

This measure of semantic similarity has several desirable 
properties and a solid theoretical justification. It is a straight- 
forward extension of the information-theoretic similarity mea- 
sure [12], designed to compensate for the fact that the tree 
can be unbalanced both in terms of its topology and of the 
relative size of its nodes. For a perfectly balanced tree of 
corresponds to the familiar tree distance measure [10]. 

In prior work [14, 15, 16] we computed the o? measure 
for all pairs of pages in a stratified sample of about 150,000 
pages from across the ODP. For each of the resulting 3.8x 10° 
pairs we also computed text and link similarity measures, 
and mapped the correlations between these and semantic 
similarity. An interesting result was that these correlations 


Figure 1: Illustration of tree-based semantic simi- 
larity in a taxonomy. 


were quite weak across all pairs, but became significantly 
stronger for pages within certain top level categories such 
as “news” and “reference.” However, because o? is defined 
only in terms of the hierarchical component of the ODP, 
it fails to capture many semantic relationships induced by 
the ontology’s non-hierarchical components (symbolic and 
related links). As a result, the tree-based semantic similar- 
ity between pages in topics that belong to different top-level 
categories is zero even if the topics are clearly related. This 
yielded an unreliable picture when all topics were consid- 
ered. 


2.2 Graph-Based Similarity 


Let us now generalize the semantic similarity measure to 
deal with arbitrary graphs. We wish to define a graph- 
based semantic similarity measure of that generalizes the 
tree-based similarity ø? to exploit both the hierarchical and 
non-hierarchical components of an ontology. 

A topic ontology graph is a graph of nodes representing 
topics. Each node contains objects representing documents 
(pages). An ontology graph has a hierarchical (tree) compo- 
nent made by “is-a” links, and a non-hierarchical component 
made by cross links of different types. 

For example, the ODP ontology is a directed graph G = 
(V, E) where: 


e V is a set of nodes, representing topics containing doc- 
uments; 


e F is a set of edges between nodes in V, partitioned 
into three subsets T, S and R, such that: 


— T corresponds to the hierarchical component of 
the ontology, 


— & corresponds to the non-hierarchical component 
made of “symbolic” cross-links, 


— R corresponds to the non-hierarchical component 
made of “related” cross-links. 


Figure 2 shows a simple example of an ontology graph G. 
This is defined by the sets V = {t1, ta, ts, ta, ts, te, t7, ts}, 
T= {(t1, t2), (ti, t3), (ti, ta), (t3, ts), (ts, te), (te, t7), (te, ts)}, 
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Edge Type 
T 

=S 

TTT > R 


Figure 2: Illustration of a simple ontology. 


S = {(ts,t3)}, and R = {(te,t2)}. In addition, each node 
t € V contains a set of objects. We use |t| to refer to the 
number of objects stored in node t (e.g, |t3| = 4). 

The extension of o? to an ontology graph raises two ques- 
tions. First, how to find the most specific common ancestor 
of a pair of topics in a graph; second, how to extend the 
definition of subtree rooted at a topic for the graph case. 

An important distinction between taxonomies and ontolo- 
gies such as the ODP graph is that edges in a taxonomy are 
all of the same type (“is-a” links), while in the ODP graph 
edges can have diverse types (e.g., “is-a”, “symbolic”, “re- 
lated”). Different types of edges have different meanings 
and should be used accordingly. One way to distinguish 
the role of different edges is to assign them weights, and to 
vary these weights according to the edge’s type. The weight 
wij € [0,1] for an edge between topic t; and t; can be inter- 
preted as an explicit measure of the degree of membership 
of tj in the family of topics rooted at t;. The weight set- 
ting we have adopted for the edges in the ODP graph is 
as follows: wi; = a for (i,j) € T, wij = B for (i,j) € S, 
and wij = y for (i,j) € R. We set a = 8 = 1 because sym- 
bolic links seem to be treated as first-class taxonomy (“is-a” ) 
links in the ODP Web interface. Since duplication of URLs 
is disallowed, symbolic links are a way to represent multiple 
memberships, for example the fact that the pages in topic 
“Society /Issues/Fraud/Internet” also belong to topic “Com- 
puters/Internet/Fraud.” On the other hand, we set y = 0.5 
because related links are treated differently in the ODP Web 
interface, labeled as “see also” topics. Intuitively the seman- 
tic relationship is weaker. Different weighting schemes could 
be explored. 

As a starting point, let wi; > 0 if and only if there is 
an edge of some type between topics t; and tj. However, 
to estimate topic membership, transitive relations between 
edges should also be considered. Let t;| be the family of 
topics t; such that either i = j or there is a path (e1,..., en) 
satisfying: 


1. e1 = (ti, tk) for some tk E€ V, 
2. en = (te, tj) for some tk E€ V, 
3. e ETUSUR fork=1...n, 
4. ek E SUR for at most one k. 


The above conditions express that tj € ti | if there is a di- 
rected path in the graph G from t; to tj, where at most 
one edge from S or R participates in the path. The mo- 
tivation for disregarding multiple non-hierarchical links in 
the transitive relations that determine topic membership is 
both practical and conceptual. From a computational per- 
spective, allowing multiple cross links is infeasible because 
it leads to a dense topic membership, i.e., every topic be- 
longs to almost every other topic. This is also not robust 
because a few unreliable cross links make significant global 
changes to the membership functions. More importantly, 
considering multiple cross links in each path would make 
the classification meaningless by mixing all topics together. 
Considering at most one cross link in each membership path 
allows us to capture the non-hierarchical components of the 
ontology while preserving feasibility, robustness, and mean- 
ing. We refer to t;| as the cone of topic ti. Because edges 
may be associated with different weights, different topics tj 
can have different degree of membership in fi]. 

In order to make the implicit membership relations ex- 
plicit, we represent the graph structure by means of adja- 
cency matrices and apply a number of operations to them. 
A matrix T is used to represent the hierarchical structure of 
an ontology. Matrix T codifies edges in T, augmented with 
1s on the diagonal: 


1 ifi=j, 
a ifi# 7 and (i,j) 
0 otherwise. 


ET, 


We use additional adjacency matrices to represent the 
non-hierarchical components of an ontology. For the case 
of the ODP graph, a matrix S is defined so that S;; = 8 
if (i,j) € S and S;; = 0 otherwise. A matrix R is defined 
analogously, as Ri; = y if (i,j) € R and Ri; = 0 otherwise. 
Consider the operation V on matrices, defined as [AV B]i; = 
max(A;j, Bij), and let G = TV SVR. Matrix G is the 
adjacency matrix of graph G augmented with 1s on the 
diagonal. 

We will use the MaxProduct fuzzy composition function © 
[8] defined on matrices as follows:? 


[A © Bhi; = max(Aijx - Bzj). 


Let T® = T and TC) = TO © TO. We define the 
closure of T, denoted T* as follows: 

Tt = lim T™, 
In this matrix, T$ = 1 if tj € subtree(t:), and Ti = 0 
otherwise. Note that the computation of the closure TT 
converges in a number of steps which is bounded by the 
maximum depth of the tree T, is independent of the weight 


a, and does not involve the weights 6 and y. 
Finally, we compute the matrix W as follows: 


W=T'OGOoT'. 


The element W;; can be interpreted as a fuzzy membership 
value of topic t; in the cone t:l, therefore we refer to W as 
the fuzzy membership matrix of G. 


2With our choice of weights, MaxProduct composition is 
equivalent to MaxMin composition. 


110 


As an illustration, consider the example ontology in Fig- 
ure 2. In this case the matrices T, G, T+ and W are defined 
as follows: 


tı t2 t3 ta ts te tr ts 

ti /1 1 1 1 0 0 0 0 

t |0 1 0 0 0 0 0 0 

tz} 0 0 1 0 1 1 0 0 

T= t4ļ| 0 0 0 1 0 0 0 0 

ts| 0 0 0 0 1 0 0 0 

te] 0 0 0 0 O 1 1 1 

t| 0 0 0 0 0 O 1 Q 

ts \O 0 0 0 0 0 0 1 

ti t2 t3 t4 t5, te tr ts 

ti /1 1 1 1 0 0 0 0 

t {0 1 0 0 0 0 0 0 

tz} 0 0 1 0 1 1 0 +0 

G= t4ļ| 0 0 0 1 0 0 0 0 

t | 0 0 0 0 1 0 0 0 

t}O0 5 0 0 O 1 1 1 

t7}|0 0 0 0 0 O 1 Q 

ts \O 0 1 0 0 0 0 1 
tı t2 t3 ta ts te t7 ts 
subtree(tı) /1 1 1 1 1 1 1 1 
subtree(t2) | 0 1 0 0 0 0 0 Ọ 
subtree(t3)| 0 O 1 0 1 1 1 1 
Tt subtree(ta)| 0 0 0 1 0 0 0 O 
-~ subtree(ts)| 0 0 0 O 1 0 0 0 
subtree(ts)| 0 0 0 0 O 1 1 1 
subtree(t7)| 0 0 0 0 0 0 1 =O 
subtree(ts) \O0 0 0 0 0 0 0 1 

ti t2 t3 ta ts te tr ts 

tefl, Ae De Io a I 

tal{ 0 1 0 0 0 0 0 0 

tz}| 0 6 1 0 1 1 1 1 

tal] 0 0 O 1 0 0 0 0 

Wea tl] 0 0 0 O 1 0 0 0O 

tl | 0 6 1 0 1 1 1 1 

tl{ 0 0 0 0 0 0 1 =O 

tz} \O O 1 0 1 1 1 1 


The semantic similarity between two topics tı and t2 in 
an ontology graph can now be estimated as follows: 
oe 2- min (Wi, Wx) - log Pr[éx] 
x s 
k log(Pr[tı|tk]:Pr[tk]) + log(Pr[te|tx]-Pr[tx]) 


a$ (ti, t2) = 


The probability Pr[t,] represents the prior probability that 
any document is classified under topic tk and is computed 
as: 


= Zeev (Wry i |t;|) 

|U] ? 
where |U| is the number of documents in the ontology. The 
posterior probability Pr[t;|t;] represents the probability that 


any document will be classified under topic t; given that it 
is classified under tg, and is computed as follows: 


Dit, ev (min(Wi;, Wez) j |t;|) 
Zeev Was - |t5|) 


Pritx] 


Pr{tilts] = 


The proposed definition of o© is a generalization of of. 
In the special case when G is a tree (i.e., S R = 0), 
then t; | is equal to subtree(t;), the topic subtree rooted 
at ti, and all topics t € subtree(t;) belong to t; | with a 
degree of membership equal to 1. If t is an ancestor of 
tı and tz in a taxonomy, then min(W;1,W,2) = 1 and 
Pr{ti|tx]- Pr[tk] = Pr[t:] for i = 1,2. In addition, if there are 
no cross-links in G, the topic tg whose index k maximizes 
of (t1,t2) corresponds to the lowest common ancestor of tı 
and tə. 


3. EVALUATION 


The proposed graph-based semantic similarity measure 
was applied to the ODP ontology. The portion of the ODP 
graph we have used for our analysis consists of more than 
half million topic nodes (only World and Regional categories 
were discarded). Computing semantic similarity for each 
pair of nodes in such a huge graph required more than 5,000 
CPU hours on IU’s Analysis and Visualization of Instrument- 
Driven Data (AVIDD) supercomputer facility. The com- 
putational component of AVIDD consists of two clusters, 
each with 208 Prestonia 2.4-GHz processors. The computed 
graph-based semantic similarity measurements in compressed 
format occupies more than 1 TB of IU’s Massive Data Stor- 
age System. After computing the graph-based semantic sim- 
ilarity, we dynamically computed the less computationally 
expensive tree-based semantic similarity on the same ODP 
topic pairs. 


3.1 Analysis of Differences 


The first question to ask of the newly proposed graph- 
based semantic similarity definition is whether it produces 
different measurements from the traditional tree-based simi- 
larity. The two measures are moderately correlated (Pearson 
coefficient rp = 0.51). To dig deeper, we map in Figure 3 
the distributions of similarities. Each (o7,0°%) coordinate 
encodes how many pairs of pages in the ODP have semantic 
similarities falling in the corresponding bin. By definition 
of is a lower bound for oÑ. Significant numbers of pairs 
yield of > ol, indicating that the graph-based measure in- 
deed captures semantic relationships that are missed by the 
tree-based measure. The largest difference is hard to ob- 
serve in the map because it occurs in the of = 0 bins. Here 
there are many pairs in different top-level categories of the 
ODP, which are related according to non-hierarchical links. 

To better quantify the differences between o? and of, 
Figure 3 also shows the average graph-based similarity (0%) 
as a function of a2. The relative difference is as large as 
20% around of = 0.32. The inset highlights the largest 
difference, which occurs for oJ = 0. 


3.2 Validation by User Study 


Knowing that tree-based and graph-based measures give 
us quantitatively different estimates of semantic similarity, 
we conducted a human-subjects experiment to evaluate the 
proposed graph-based measure ao. As a baseline for com- 
parison we used Lin’s tree-based measure a2. The goal of 
this experiment was to contrast the predictions of the two 
semantic similarity measures against human judgments of 
Web pages relatedness. 

Thirty-eight volunteer subjects were recruited for a 30 
minute experiment conducted online. Subjects answered 30 
questions about similarity between Web pages. For each 
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Figure 3: Top: 200 x 200 bin histogram showing the 
distributions of 1.26 x 10'” pairs of pages according 
to tree-based vs. graph-based semantic similarity. 
Colors encode numbers of pairs on a log scale. Bot- 
tom: Averaging of of for each o? bin highlights the 
difference between the two similarity measurements. 


question, they were presented with a target Web page and 
two candidate Web pages (see Figure 4). The subjects had 
to answer by selecting from the two candidate pages the one 
that was more related to the target Web page or by indicat- 
ing that neither of the candidate pages was related to the 
target. A total of 6 target Web pages randomly selected 
from the ODP directory were used for the evaluation. For 
each target Web page we presented a series of 5 pairs of can- 
didate Web pages. To investigate which of the two methods 
was a better predictor of human assessments of Web page 
similarity, the candidate pages were selected with controlled 
differences in their semantic similarity to the target page. 
Given a target Web page p7, each pair of candidate pages 
p? and p§ used in our study satisfied the following two con- 
ditions: 


Condition 1: 


os (PY, p”) > os (P3, p 
Condition 2: ( 


of (pf p”) < Os 


Web Page Relatedness 


pretsOnline.com 


Buy 
amazoncouk 


‘Amazon Recommends: 


The Eagle Speaks 


felts and welcome to Muppetsonine.cm. This Sam the Eagle speaking, You can travet around tis te 

ly by clicking the ‘to your left. Home will take you hack here. Each section is locked after by a 
Nuppet, 30 look around as there Tots todo, VESE the Muppet Shop and uy some af out work for yoursel? or 
a loved one. It patriotic and American, 


ichael Cane 
Our Price: £9.99 


£6.97 
You can download sounds, midis, wallpapers, fluens Treasure Islnd 


‘Thank you for your attention. 


while you explore MuppetsCnline.com. 


11995) 
The Muppets 


Which of the following Web pages is more related to the Web page above? 


Yale 
Anti-Gr iia Society 


© The one on the left @ Neither Is related © The one on the right 


. THE 
| ENTERTAINMENT 
BUSINESS 


"YOUR ONE SOURCE FOR QUALITY FAMILY ENTERTAINMENT" 


THE LIVE CAST OF 
SESAME STREET 


The accolades have all been written many, many times. 


Quite simply, there has never been or exists now 
a television show like SESAME STREET. 


SESAME STREET has educated 
ard entertained millions of children 
all over the world for thirty years! Its name has 
become synonymous with high quality. Much of the 
show's success is due to an extremely talented cast of actors. v 


Figure 4: A snapshot of the experiment setup for 
our user study. The pages displayed are those of 
Table 1. 


The use of the above conditions guarantees that for each 
question the two models disagreed on their prediction of 
which of the two candidate pages is more related to the tar- 
get page. The pages in the 30 triplets were chosen at random 
among all the cases satisfying the above conditions. To en- 
sure that the participants made their choice independently 
of the questions already answered, we randomized the order 
of the options. Table 1 shows an example of a triplet of 
pages used in our study, corresponding to the question in 
the snapshot of Figure 4. The users were presented with the 
target and candidate pages only — no information related 
to the topics of the pages was shown to the users. 

The semantic similarity between the target page and each 
of the candidate pages in our example, according to the two 
measurements is as follows: 


of (pp?) =0.24 oF (pf ,p7) = 0.50 
S (pf, p7) =0.91 of (pF, p”) = 0.70 


For this triplet of pages, the tree-based method predicts that 
pS is more similar to the target than pf (of (pf,p?) > 
o! (pf, p™)). On the other hand, according to the prediction 
made by the graph-based method pf should be preferred 
over p$ (o$ (pY, p™) > a$ (pF, p™)). 

To test which of the two methods was a better predictor 
of subjects’ judgments of Web page similarity we considered 
the selections made by each of the human-subjects and com- 
puted the percentage of correct predictions made by the two 
methods. Table 2 summarizes the statistical results. This 
comparison table shows that the graph-based semantic sim- 
ilarity measure results in statistically significant improve- 
ments over the tree-based one.’ 


3This made it unnecessary to recruit a larger subject pool. 
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Table 2: Mean, standard deviation, and standard er- 
ror of the percentage of correct predictions by tree- 
based vs. graph-based semantic similarity, as deter- 
mined from the assessments by the N subjects. The 
fact that the confidence intervals do not overlap is 
equivalent to using a t-test to determine that the 
difference in average accuracy is statistically signif- 
icant at the 95% confidence level. 


N MEAN STDEV SE 95% C.I. 
of 38 5.70% 471% 0.76% (4.2%, 7.2%) 
of 38 84.65% 11.19% 1.82% (81.1%, 88.2%) 


4. APPLICATIONS 


Having validated our semantic similarity measure o©, let 
us now begin to explore its applications to performance eval- 
uation. Using øf as a surrogate for user assessments of 
semantic similarity, we can address the general question of 
how text and link analyses can be combined to derive mea- 
sures of relevance that are in good agreement with semantic 
similarity. An analogous approach has been used in the past 
to evaluate similarity search, but relying on only the hierar- 
chical ODP structure as a proxy for semantic similarity [7, 
16]. 

Let us start by introducing two representative similarity 
measures gc and gg based on textual content and hyperlinks, 
respectively. Each is based on the TF-IDF vector represen- 
tation and “cosine similarity” function traditionally used in 
information retrieval [20]. For content similarity we use: 


Dio: Dor 
ell 2l 
where (p1, p2) is a pair of Web pages and 7° is the TF-IDF 
vector representation of p;, based on the terms in the page. 
Noise words are eliminated [4] and other words are conflated 


using the standard Porter stemmer [18]. 
For link similarity measure we define: 


O(P1, p2) = 


>L >l 
Pı ` P2 


o2(p1, P2) = 735, irt 
Ig I 


where p;" is the link frequency-inverse document frequency 
(LF-IDF) vector representation of page p;. LF-IDF is anal- 
ogous to TF-IDF, except that hyperlinks (URLs) are used 
in place of words (terms). A page link vector is composed of 
its outlinks, inlinks, and the pages’s own URL. Link similar- 
ity is a measure of the local undirected clustering coefficient 
between two pages. A high value of o¢ indicates that the 
two pages belong to a clique of pages. Related measures are 
often used in link analysis to identify a community around 
a topic. This measure generalizes co-citation [21] and bib- 
liographic coupling [9], but also considers directed paths of 
length L < 2 links between pages. Such directed paths are 
important because they could be navigated by a user or 
crawler. Outlinks were obtained from the pages themselves, 
while inlinks were obtained from a search engine.* 

One could of course explore alternative content and link 
similarity measures, however our preliminary experiments 
indicate that other commonly used measures such as TF- 


“We used the Google Web API (www. google.com/apis/) 
with special permission. 


Table 1: Example of a triplet used in the evaluation 


Page | URL 


Topic 


p http://www.muppetsonline.com/ 


Arts 
Performing_Arts 
Puppetry 
Muppets 


http://www.theentertainmentbusiness.com/sesame.htm 


Arts 
Television 
Programs 
Children’s 
Sesame_Street 
Characters 


http://www.yale.edu/yags/ 


Arts 

Performing_Arts 

Circus 

Juggling 
Clubs_and_Organizations 
College_Juggling-Clubs 


based cosine similarity and the Jaccard coefficient do not 
qualitatively alter the observations that follow. 

Once text and links were extracted from the 1.12 x 10° 
Web pages of the ODP ontology, oe € [0,1] and øe € [0,1] 
were computed for each of 1.26 x 10!? pairs of pages. Se- 
mantic similarities o7? and o% were measured as well. Two 
200 x 200 x 200 histograms with coordinates (oe, ce, o7) and 
(oc, 0¢,0¢) were generated to analyze the relationships be- 
tween the various similarity measures. We focus on the lat- 
ter, graph-based semantic similarity in the following anal- 
ysis. The computation of these histograms (and the one 
for (o2 ,o%), cf. Section 3.1) required approximately 4,000 


additional CPU hours on the AVIDD facility. 


4.1 Combining Content and Link Similarity 


The massive data thus collected allows us to study how 
well different automatic similarity measures based on ob- 
servable features (content and links) approximate seman- 
tic similarity. We considered a number of simple functions 
f (oe, oe) including: 


e various linear combinations f = Ace + (1 — A)oe for 
0 < À < 1, of which we report the cases A = 0 (f = ce), 
à = 0.2, A = 0.8, and A = 1 (f = se); 


e the product f = ocg; 


e the step-linear function f = oc H (o), where H (oe) 
1 for o¢ > 0 and 0 otherwise; 


and other functions omitted for space considerations. Fig- 
ure 5 plots the Pearson and Spearman correlations between 
o© and these functions, versus a threshold on ae. 

The Pearson correlation coefficient rp tells us the de- 
gree to which the values of each function f(oc,oe) agree 
with o&. We can see that the correlations are rather weak, 
0 < rp < 0.2, for all f in the plot when we consider all 
page pairs. If we restrict the analysis to pairs that have 
content similarity oe above a minimum threshold, the cor- 
relations can become much stronger. It is meaningful to use 
a oe threshold because in applications such as search en- 
gines, the pages to be ranked are those that are retrieved 
from an index based on a match, typically between pages 
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and a user query or some other model page. It is interesting 
to observe that the functions that rely heavily on content 
similarity (f = Ase + (1 — A)oe for high A) perform par- 
ticularly poorly at predicting semantic similarity. They are 
at best weakly correlated with øf unless one applies a very 
high ce threshold. This is rather surprising because prior 
to the introduction of link based importance measures such 
as PageRank [1] content was the sole source of evidence for 
ranking pages, and content similarity is still widely seen as 
a central component of any ranking algorithm. 

The Pearson correlation assumes normally distributed val- 
ues. Since the similarity functions defined above have mostly 
exponential distributions, it is worth to validate the above 
results using the Spearman rank order correlation coefficient 
rg, which is high if two functions agree on the rankings they 
produce irrespective of the actual values. This is reasonable 
in our setting because from a search engine user perspec- 
tive, what matters is the order of the hit pages and not the 
values used by the ranking function. The Spearman correla- 
tion data in Figure 5 confirms the above observations, with 
even more striking evidence of the noisy nature of content 
similarity. One can see a clear separation between the poor 
rankings produced by functions that depend linearly on ce 
and the relatively good rankings produced by functions that 
either do not consider e or that scale oe by oe. 

The above analysis highlights an extremely low discrim- 
ination power of lexical similarity. This might suggest a 
filtering role for lexical similarity, in which all pages below 
a small threshold would not be considered while above the 
threshold only link-based measures would be used for the 
sake of ranking. While such a bold strategy must be scru- 
tinized carefully, it could lead to a significant simplification 
of ranking algorithms. 


4.2 Evaluating Ranking Functions 


Let us finally illustrate how the proposed semantic simi- 
larity function can be used to automatically evaluate alter- 
native ranking functions. This makes it possible to mine 
through a large number of alternative functions automati- 
cally and cheaply, reserving user studies for the most promis- 
ing candidates. We want to compare the quality of a ranking 
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Figure 5: Pearson (top) and Spearman (bottom) 
correlations between graph-based semantic similar- 
ity of and different functional combinations of con- 
tent and link similarity, applying increasing thresh- 
olds on content similarity. 


function to the baseline ranking obtained by the use of se- 
mantic similarity. The sliding ratio score [17, 11] compares 
two rankings when graded quality assessments are avail- 
able.” This measure is defined as the ratio between the 
cumulative quality scores of the top-ranked pages according 
to two ranking functions. We can generalize the sliding ratio 
in the following ways: 


e use a page as a target rather than an arbitrary query, 
as is done in “query by example” systems; 


G 


e use o,’ as a reference ranking function; 


e sum over all pages in an ontology such as the ODP, 
each used in turn as a target, thus covering the en- 
tire topical space and eliminating the dependence on 
a single target. 


5In the common case when just binary relevance assessments 
are available, one resorts to precision and recall; the sliding 
ratio score is a more sophisticated measure enabled by more 
refined semantic similarity data. 
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Generalized Sliding Ratio Score 
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Figure 6: Generalized sliding ratio score plots for 
different functional combinations of content and link 
similarity. We omit the region N < 10° where GSR 
is constant for all f up to the resolution of our his- 
togram bins. 


Let us thus define a generalized sliding ratio score as follows: 


N 


2 


(i, j):rank p (i,j)=1 
N 


5 


(j): rank Gg (,j)=1 


o$ (i,j) 


GSR(f, N) = 


a$ (i,j) 


where (i, 7) is a pair of pages, f is a ranking function to be 
tested, and N is the number of top-ranked pairs considered. 
Note that for any f, GSR(f, N) — 1 as N tends to the total 
number of pairs. The ideal ranking function is one such 
that GSR(f, N) ~ 1 for low N as well. In simplistic terms, 
GSR(f,N) tells us how well a function f ranks the top N 
pairs of pages. 

The generalized sliding ratio score can be readily mea- 
sured on our ODP data for any f(oc,o¢). Only pairs with 
Oc > 0 are considered, since typically in a search engine only 
pages matching the query are retrieved. In Figure 6 we plot 
GSR(f,N) versus N for the simple combination functions 
f(oc,o¢) introduced in Section 4.1. Consistently with the 
correlation results, the functions that depend heavily on con- 
tent similarity rank poorly. Again this is only an illustration 
of how the oÏ measure can be applied to the evaluation of 
arbitrary ranking functions. 


5. DISCUSSION 


In this paper we introduced a novel measure of semantic 
similarity for Web pages that generalizes the well-founded 
information-theoretic tree-based semantic similarity measure 
to the general case in which pages are classified in the nodes 
of an arbitrary graph ontology with both hierarchical and 
non-hierarchical components. This measure can be readily 
applied to mine semantic data from topical ontologies and 
Web directories such as Yahoo!, the ODP and their deriva- 
tives. 

Similarity is commonly viewed as an example of relation 
satisfying the following three conditions: 


e Maximality: o(a,b) < o(a,a) = 1. 
e Symmetry: o(a,b) = o(b,a). 
e Triangular Inequality: o(a,b) - o(b,c) < o(a,c). 


These conditions are adaptations of the minimality, symme- 
try and triangle inequality axioms of metric distance func- 
tions. The definition of c& proposed in this paper satisfies 
maximality and symmetry but not the triangular inequal- 
ity condition. With sufficient computational resources, a 
new measure of semantic similarity satisfying the triangular 
inequality principle can be computed by applying an adap- 
tation of Dijkstra’s shortest path algorithm [2] to o¢: 


ij) = oS i) 
ota, j) = max (a(i j), max (6, k) ok, §))) 


lim oi, j) 


r= oo 


a(i, j) 


While in many cases the lower limit imposed by the trian- 
gular inequality appears to be intuitive, many authors have 
argued against it. Tversky [22] illustrates this position with 
an example about the similarity between countries: “Ja- 
maica is similar to Cuba (because of geographical proximity); 
Cuba is similar to Russia (because of their political affinity); 
but Jamaica and Russia are not similar at all.” This exam- 
ple fits the case of Web pages and their topics, suggesting 
that the triangular inequality should not be accepted as a 
cornerstone of similarity models. 

Computing the graph-based semantic similarity measure 
is a computationally expensive task, both in terms of space 
and time. While matrices T, G, T* and W are sparse and 
easy to store, codifying the graph-based semantic similarity 
measure of for the ODP topics required the use of 5,712 
dense matrices, each one of size 571,148 x 100. The time 
complexity for computing the semantic similarity for n top- 
ics is O(n?) in the worst case; the actual complexity depends 
on the density of the W matrix. Some of the techniques 
adopted to deal with the time complexity of the problem in- 
clude indexing the sparse structure of the matrices for fast 
access and using a software vector register to compute the 
MaxProduct fuzzy composition function efficiently. Our ap- 
proach may not scale easily to ontologies much larger than 
the ODP graph as it is today. However, approximations 
of of may be computed in reasonable time if appropriate 
heuristics are applied (e.g., via use of thresholds). 

We have shown that the proposed semantic similarity mea- 
sure predicts human judgments of relatedness with signifi- 
cantly greater accuracy than the tree-based measure. Fi- 
nally we have undertaken a massive data mining effort on 
ODP data in order to begin to explore how text and link 
analyses can be combined to derive measures of relevance in 
agreement with semantic similarity. 

The methodology described here to evaluate ranking algo- 
rithms based on semantic similarity can be applied to arbi- 
trary combinations of ranking functions stemming from text 
analysis (e.g. LSA, query expansion, tag weighting, etc.), 
link analysis (e.g. authority, PageRank, SiteRank, etc.), and 
any other features available to a search engine (e.g. fresh- 
ness, click-through rate, etc.). Yet the applications of the 
proposed semantic similarity measure are broader than just 
Web search. Classification, clustering and resource discov- 
ery also rely on semantic mining of features that can be 
extracted automatically. 
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The main, surprising result of our initial analysis with the 
graph-based semantic similarity is that the classic text-based 
TF-IDF cosine similarity is an extremely noisy feature, un- 
fit for ranking Web pages. While it seems helpful to filter 
out pages with very low lexical similarity (øe < 0.05), text- 
based measures do not seem to help in ranking the remaining 
pages. On the contrary they are very poorly correlated with 
semantic similarity, possibly reflecting the extent to which 
ambiguous terms mislead the search process. While this re- 
sult helps to explain why early search engines did so poorly 
and validates the use of link-based measures such as PageR- 
ank, the seemingly unredeemed quality of content similarity 
is unexpected. The implication must be a revisitation of the 
role of content similarity in ranking Web results. 

We are currently exploring alternative ways to approxi- 
mate semantic similarity by integrating (rather than com- 
bining) content and link similarity. The correlation plots in 
Figure 5 suggest that content may play a positive role in 
filtering hits, if not in ranking them. 

In future work the semantic similarity measure must be 
further validated through user studies. The study presented 
here focuses on cases where o© and of disagree, and thus it 
tells us that oÏ is more accurate than o? but is too biased 
to satisfactorily answer the broader question of how well 
aŠ predicts assessments of semantic similarity by human 
subjects in general. It is possible that alternative weighting 
schemes for the different types of links in the ODP ontology 
may lead to measures with improved accuracy. 

The evaluations outlined here have focused on purely local 
text and link analysis. For example, we have not looked at 
the role of more global link and text analysis techniques such 
as PageRank and latent semantic analysis (LSA) in improv- 
ing the quality of ranking by favoring authoritative pages or 
improving content similarity. These are also directions for 
future work. 

Due to the growing number of emerging Web search tech- 
niques and the scale of the Web, automatic evaluation mech- 
anisms are crucial. In the light of the availability of rich se- 
mantic information sources, like the ODP ontology, we have 
proposed a reliable method for the algorithmic detection 
of semantic similarity between Web pages. The proposed 
approach will provide insight for better understanding the 
limitations of existing search techniques and inspire the de- 
velopment of new and more powerful Web search tools. 
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